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ABSTRACT 

Online service platforms (OSPs), such as search engines, 
news- websites, ad- providers, etc., serve highly personalized 
content to the user, based on the profile extracted from his 
history with the OSP. Although personalization (generally) 
leads to a better user experience, it also raises privacy con- 
cerns for the user — he does not know what is present in his 
profile and more importantly, what is being used to per- 
sonalize content for him. In this paper, we capture OSP's 
personalization for an user in a new data structure called 
the personalization vector (rf), which is a weighted vector 
over a set of topics, and present techniques to compute it 
for users of an OSP. 

Our approach treats OSPs as black-boxes, and extracts 
77 by mining only their output, specifically, the personalized 
(for an user) and vanilla (without any user information) 
contents served, and the differences in these content. We 
believe that such treatment of OSPs is a unique aspect of 
our work, not just enabling access to (so far hidden) profiles 
in OSPs, but also providing a novel and practical approach 
for retrieving information from OSPs by mining differences 
in their outputs. 

We formulate a new model called Latent Topic Personal- 
ization (LTP) that captures the personalization vector into 
a learning framework and present efficient inference algo- 
rithms for it. We do extensive experiments for search result 
personalization using both data from real Google users and 
synthetic datasets. Our results show high accuracy (R-pre 
= 84%) of LTP in finding personalized topics. For Google 
data, our qualitative results show how LTP can also iden- 
tifies evidences — queries for results on a topic with high rj 
value were re-ranked. Finally, we show how our approach 
can be used to build a new Privacy evaluation framework 
focused at end-user privacy on commercial OSPs. 

1. INTRODUCTION 

Personalization is being used by most online service plat- 
forms (OSPs) such as search, advertising, shopping, etc. The 
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goal is to lure users by offering a better service experience 
customized to their individual interests. A popular trend is 
to employ profile based personalization, where OSPs build 
extensive profile for the user (based his past interactions 
search queries, browsing history, links shared, etc.) and per- 
sonalize the content based on this profile. Severalpopular 
services employ such personalization, e.g. searclrj movie 
recommendation^ etc. 

While OSPs definitely track rich user histories, they can 
infer a great deal more by mining this raw data. Informally 
speaking, OSPs can determine users interests and biases on 
different categories, which can then be used (along with his 
history) for personalization. For example (see |19j for de- 
tails), Google is shown to have inferred users political affil- 
iations (republican or democratic), and use it to re-ranked 
results. 

For an user, this raises a significant privacy concern — he 
does not know what was tracked in his history, what has 
been inferred, and more importantly, is currently being used 
to personalize content. Moreover, as both the personaliza- 
tion techniques and the data they operate on are the key 
differentiators of these OSPs (their secret sauce), they do 
not reveal either of them, making it even harder for an user 
to understand how personalization is done for him. 

In this paper, we aim at extracting an user's profile from 
the OSP. We model an user's profile as a weighted person- 
alization vector over topics, where the weight on a topic 
indicates his interest in it (higher means more interested). 
Informally, a topic is any concept or phenomenon that the 
user could be interested in, e.g. a specific sport, a preference 
over cuisine, favorite author, movie genre, etcrj 

Our goal is not to reverse-engineer the OSPs inference al- 
gorithms. In fact, we treat the OSP as a black-box. We 
assume that we only have access to their output, which is 
basically the (personalized) content served by them to the 
user on different url^ The key idea is to get both personal- 
ized content (served for the user) and vanilla content (served 
for a new/not logged in user) for the same url from the OSP, 
and determine the topics of personalization based these two 
content and the differences among them. 

^http://privacy.microsoft.com/en-us/Bing.mspx; 
http: //support. google. com/websearch/bin/answer.py?answer=1710607. 
^Netflix: http:/ /www. netflixprize.com 

^More specifically, we define it as a distribution over bag 
of words as common in the topic modeling literature (see 
Section [3] for details). 

*The url could point to a static page, e.g. reviews and 
other information on a movie, or dynamically generated, e.g. 
search results. 



The profiles in OSPs have remained opaque so far, with 
httle knowledge of user profiles hidden in them. Our pa- 
per provides a novel approach to crack this problem, giving 
insights into the user profiles without the knowledge of the 
inference techniques or the history of the user. We believe 
that this aspect of comparing the differences in output to 
extract the hidden personalized topics is unique to our pa- 
per and opens a new directions in privacy research that can 
be aimed at commercial OSPs. 

Let us consider the case of a search engine. For any query, 
we can get the personalized and vanilla results by making the 
query from a browser with and without logging in, respec- 
tively. These results are basically two ranked lists with some 
urls in the latter moved up or down in the former, based on 
the user profile. We study these movements over multiple 
queries and determine the most likely topics of interest for 
the user that can best explain them. 

For the remaining of this paper we will talk only about 
search result personalization. However, our techniques can 
be easily extended to any service where a) we can observe 
both vanilla and personalized content and b) we can get a 
ranked ordering of the content. For example, we can ap- 
ply it to movie recommendation (in say Netfiix) based on 
the personalized (and vanilla) ranked list of related movies 
presented when on a web-page of a particular movie. 

1.1 Search Personalization and Re-ranking 

Although the exact details of personalization for many 
popular services of today's web remain a mystery, recent 
works in the web-search community have thrown some light 
into the intricacies of search engine personalization [sj |22[ 
|17| . These techniques vary considerably in terms of their de- 
scription and complexity but the common underlying theme 
for them is to first populate the vanilla result using the se- 
mantics of the query string and then personalize it by re- 
arranging the items in this list, using the profile informa- 
tion. Therefore, conceptually, the vanilla and personalized 
responses are re-ordering of the same set of items. We take 
advantage of this re-ranking of results to determine the top- 
ics present in the user's profile with the OSP. 

The restriction of re-ranking over the same urls is useful 
for exposition of our solution approach, but can be easily 
lifted by simply adding the extra urls in one list to the end of 
the other lisiPl The important point is that personalization, 
by definition, will affect ranks of results shown, which is 
what we use in this paper. 

Note that these topics may not be explicitly maintained at 
the OSP; in fact they could be using something completely 
unrelated to our definition of topics to model the user pro- 
file. Our paper hinges on the intuition that an user's in- 
terests with most OSPs can be captured by a set of topics 
that he is interested in. And any OSP that personalizes re- 
sults based on his interests must give higher preference to 
results matching these topics. Thus our approach of find- 
ing topic-level personalization is fairly generic — working on 
OSPs who do not necessarily have topic-based profiles of 
users and without the knowledge of the profiling algorithms 
they use. 

A alternate competitive approach to recreate the user pro- 

^In our experiments with Google, only 15% of person- 
alized results contain any extra result compared to vanilla, 
and even these contain on average only 14% extra urls (or, 
1.4 urls for an avg. result size of 10). 



file could be by mining the input to the OSP (i.e. user's his- 
tory) [sj [22] [2l] . However, this approach has several short- 
comings compared to us. One, it is very hard to catch up 
to the commercial techniques used by OSPs that are usually 
more advanced and rapidly evolving. Two, due to propri- 
etary nature of OSPs, it is not clear what algorithm or even 
what part of the history is being used by them (e.g. most 
personalization is done using only recent history, but it is 
not clear as to how recent for each user). In other words, 
with any profiling tool, there is no certainty that it can infer 
all that the OSP has. Finally, in many cases the history in- 
formation may not be available publicly (i.e. while a Google 
user's search history is available, past ads served are not), 
limiting the effectiveness of these approaches. In contrast, 
our approach is agnostic to OSP's personalization scheme 
and can work even when the history is not public. 

1.2 A new privacy preserving framework 

The topics of personalization for an user can be utilized in 
building a novel privacy evaluation and prevention toolkit, 
that we describe next. The toolkit presents user with the 
topics his profile is personalized on and ask him to deter- 
mine, based on his personal judgment, whether some of these 
topics are sensitiiiqj Now, a topic which is both sensitive 
and has high personalization score can be deemed a privacy 
leak, as the OSP is using an user's data in a way he does 
not agree with. These leaks for a user can now be detected 
and monitored over time, and in several cases can also be 
plugged, e.g. by undoing the re-ranking on sensitive top- 
ics or simply served the vanilla content. These ideas are 
currently being developed into a privacy preserving toolkit 
in which the techniques developed in this paper are a key 
component. 

1.3 Our Contributions 

The main contributions of the paper are as follows. 

• We propose a new direction in privacy research that en- 
ables users in getting a glimpse of their profile informa- 
tion being used by commercial OSPs to serve person- 
alized content. We formally capture this information 
as a topic-level personalization vector that provides a 
concise and accurate summary of the user profile. 

• We propose a novel way to compute this topic-level 
personalization based on the personalized and vanilla 
content served by OSPs. This formulation treats the 
services as a black box and hence can work with a vari- 
ety of online services. We believe that this is a unique 
aspect of our work and can open a new direction for 
privacy research by enabling access to (so far hidden) 
profile information in OSPs. 

• We present a probabilistic model (named Latent Topic 
Personalization, or LTP) that captures the intuition 
behind our approach. LTP is both expressive and 
leads to computationally efficient inference algorithms 
(LTP-INF and LTP-EM) that find the personalization 
vector on real datasets. 

®The definition of sensitive, informally stated as any topic 
he find uncomfortable getting personalization on, can vary 
across users and could include health conditions, financial, 
sexual preferences, etc. 



Our experiments with synthetic dataset using state- 
of-the-art personalization engine show that LTP can 
learn the personalization parameters very accurately, 
getting on average 84% precision in learning personal- 
ized topics. 

We perform experiments on a novel real-life dataset 
containing the personalized and vanilla query results 
collected from 10 Google users. We also demonstrate 
how our techniques can be used to find the evidence 
of personalization which can be very helpful in user 
facing tools (see Section 1.2 I. 



2. RELATED WORK 

Search personalization: A large body of work exists on 
personalizing search results using user-profiles [sj |22[ 
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that collectively give overwhelming evidence of its bene- 
fits. More recently, researchers have also explored creating 
profiles using topic models [2l] and other textual informa- 
tion 23 . These works are not competitors of our paper, 
but rather serve as a motivation for us, as they highlight 
existence and importance of profiles in the state-of-the-art 
in personalization. 

Another body of work explores short-term and session 
based personalization [l][8], that personalize based on user's 
current intention, based on his recent history or session. 
While such approach is not aligned with our idea, there are 
two important points to note — a) they do no imply profile- 
based personalization does not happen, rather, they are typ- 
ically used in conjunction with each other [l] [Ts], and b) 
since they are applicable only during a session, it is easy 
to remove their affect by making sure no coherent session 
is tracked during our data collection (by doing queries ran- 
domly and multiple times while matching results). 

Researchers have also found that personalization is not al- 
ways beneficial and have proposed various approaches, such 
as click-entropy [24[ |26| , dynamic user interests [Ts] and 
query difficulty 1301, to filter queries that should not be per- 
sonalized (irrespective of user's profile). Such filtering is very 
hard replicate in our approach since the output may not con- 
tain any information to model them. We therefore allow for 
existence of this hidden process in our model via a latent 
variable deciding (randomly) if personalization happens on 
a query (see Section [4. 1| for details). 

Topics Models: Although topic models are clearly a pop- 
ular tool for processing textual information and have been 
also used in personalization, there is no work to our knowl- 
edge that models the differences in two documents (or two 
ranked set of documents) as us. A recent work by Bischof 
et. al.[2] comes close — they find exclusive topics (that are 
sufficiently different from each other) so that the documents 
can be classified into non-overlapping hierarchy. While this 
also involves finding topics which are present in some doc- 
uments and not in others, it is still very different from our 
approach of finding a consistent (may not be exclusive) set 
of personalized topics that can differentiate personalized and 
vanilla content. 

Privacy: Finally, our problem stems from the general area 
of user privacy. Various studies have highlighted problems of 
privacy in information leaks from OSPs 11 16 10. Korolova 
et. al. 10 showed how targeted ads can pin-point individual 
users in Facebook, Mao et. 



ever, these studies are focused on finding instances of privacy 
leaks from the entire OSP network and do not help users un- 
derstand leaks in their own account. Other approaches of 
privacy preserving personalization aim at building a system 
from the scratch that ensures certain norms are preserved 
in the personalized output, e.g. grouping user profiles [28[ 
|29| to preserve k-anonymity or making a differentially pri- 
vate recommender system[T4]. Recently, Chen et. al.[6] pre- 
sented a more user centric approach that gives user control 
over fine grained categories (represented as a fixed hierarchi- 
cal taxonomy) which they want personalization on. These 
techniques however require users switch to these new sys- 
tems from their existing OSPs, which is not practical, while 
we aim at finding personalization in existing OSPs. 

3. PROBLEM FORMULATION 

In this section, we introduce our notations and define the 
technical problem that we consider in this paper. 

3.1 Notation 

Let I = {ii, 12, ■ ■ ■ } be the universe of all the items being 
present at the personalization server, where, an item might 
represent a url (for search engines like Google, Bing etc.), a 
product web-page (for e-commerce sites like Amazon, Net- 
Flix etc.) or an advertisement (for ad servers). For a query 
q, let -Kq and Oq denote the personalized and vanilla lists of 
content. In the following discussion, we will often drop the 
subscript q, when the query is understood from the context. 

As mentioned earlier, both tt and a are treated as per- 
mutations over a set of item^/' C /. Technically, a rank- 
ing/permutatioi|^is a bijection from a set to itself. For any 
permutation vr, -nii) denotes the item assigned to rank i, 
hence tt = (7r(l), 7r(2), ■ • ■ ). n~^{d) denotes the rank i of an 
item d £ / in TT such that 7r(i) = d. For any two permuta- 
tions TT and (T, we use the notation cr~^(7r(i)) to denote the 
rank of the item 7r(i) in a. Observe that 7r~^(7r(i)) — i. We 
use Sn to denote the set of all permutation of n items. 

We assume that there are T topics {/3i,/?2,-'' iPt} in 
our system where each topic /3fc is defined as a multino- 
mial distribution over a fixed vocabulary V . For each word 
w G y, we have a parameter _dk,w ~ Pt{w \ j3k) such that 
X]uj6V /^fe.™ ~ 1- Each itenj^ j G / is represented by its 
topic-map 6i which is a multinomial distribution over the 
set of topics. By inspecting each component of 6i, one can 
infer how related the item is to a particular topic. 

We now describe our representation of topic-level user pro- 
file information. For each user u and topic Pk G P, we as- 
sociate a variable rj^^k € R- It captures the importance of 
Pk (more relevant topics have higher values) for serving per- 
sonalized content to u. The complete profile information 
(we name it as latent personalization vector) is denoted by 
riu = (??u,i, f?u,2, ■ • • ,Vu,t)- We often drop the subscript u 
and refer to it simply as rj whenever the user is understood 
from the context. 
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analyzed tweets to find 
vacation plans, medical conditions etc. for real users. How- 



'^Strictly speaking, they may not contain exactly same set 
of items, but it is normally the case. E.g. in our experiments 
with Google, personalized results are identical to vanilla for 
85% of queries and contain only avg. 14% extra items on 
the remaining. These extra items can be handled easily by 
adding them to the end of personalized (or vanilla) list. 

*We often use them interchangeably. 

^Specifically, the textual content or meta-data of the 
item. 



3.2 Problem 

Our strategy to learn the personalization vector 77 is to 
repeatedly frame queries to the server and observe the dif- 
ference between its vanilla and personalized responses. For 
a given user u, we first sign-in to her account and submit a 
query to the server. This gives the server an opportunity to 
personalize the result by using u's profile information and 
through this process, we obtain the personalized response n. 
Next, we submit the same query in an anonymized form, by 
removing all cookies from the http request, thus removing 
all account details (but keeping all other parameters same 
such as IP address. User- Agent, etc.). This time the server 
sends back the vanilla response a. We expect that as this 
process is repeated many times, the cumulative difference 
between these two responses will become statistically signif- 
icant and contain substantial evidence of 77. In this paper, 
we study the following problem: Given pairs of query results 
((7i,7ri), ((72,712) • ■ • {(Tm,T^m.), how do wc leam the latent per- 
sonalization vector rj, for a given user? 

Non-profile factors Although personalization normally 
yields its benefits by presenting more relevant results to the 
users, it is also known to be less effective and even detrimen- 
tal in many cases. For example, while personalizing results 
are known to work well for short and ambiguous queries [25] 
where user searching same query may be looking for com- 
pletely different things, for common and specific queries two 
users with very different profiles are normally looking for the 
same information and are satisfied with the same (ordering 
of) results. In such cases, even though user's profile implies 
re-ranking, the server may decide not to personalize. This 
creates a problem for our approach as a search engine's de- 
cision whether to personalize the result of a search query 
or not, is influenced not only by the topical content of the 
query result, but also through other filtering processes that 
are hidden from us. 

We take care of this in our model by introducing a la- 
tent parameter that, during training phase, filters out such 
inexplicable events and reduces the noise in the personal- 
ization vector. In our experiments with the Google dataset, 
we found several instances of queries with results at higher 
ranks having higher "scores" (see Section |4] for definition of 
scores) the ones at lower ranks, that were not personalized, 
while another query with similar scores was personalized. 
Without this latent parameter, these instances would have 
reduced the effectiveness of learning 77. 

4. LTP MODEL 

The goal of topic-based personalization learning is to cap- 
ture the following information: topics on which personaliza- 
tion takes place and a weight vector corresponding to the 
degree of personalization on these topics. In addition, the 
approach has to scale with large number of queries. To meet 
these objectives, we first propose Latent Topical Personal- 
ization model (LTP) to study the problem from a bayesian 
perspective. Following that, we develop efficient variational 
inference and estimation techniques for learning the param- 
eters of this model. 

4.1 Model Description 

We now formally describe the proposed LTP model. LTP 
models (Figure [T]) both topics and personalization. It in- 
volves a topic block to model the topical content creation of 
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Personalization Block Topic Block 

Figure 1: Graphical model representation of LTP. 

the items and a personalization block to model the person- 
alized responses (i.e. 7ri,7r2,--- ,Hm)- 

Topic Bloclc The topic block follows the description of 
standard topic models (c.f. LDA [s]) and we present it here 
for the sake of completeness. The generative process for the 
topic block is as follows 

• For each topic I3k,k — 1,2 ■■■ ,T 

I. Sample jSk ~ Dirichlet{u) . 

• For each item i £ I 

1. Sample its topic-map 9i ~ Gaussian{0, diag{a^)). 

2. For each word position j = 1- ■ ■ Ui for item i 

(a) Sample topic Kij with Pr{Ki,j — k) oc e^'-''. 

(b) Sample word Wij ~ Multinomial(f]Ki j)- 

The joint distribution for the topic-block can be written 

as 

T 

K..W,l3\a..,^) = Y{ p(6i I q) • n Pd^" I 
iei k=l 

n np(^'.j I s^)-p{w,,j I k,,,.0i...t) (1) 

iei j=i 

Personalization Block Our design of the personaliza- 
tion block is little more involved. The main difficulty stems 
from the non-profile based factors, which may lead to no 
re-ranking of results even when the user profile (i.e. 77) indi- 
cates personalization should happen. In LTP, we achieve it 
by introducing a latent switch variable z (refer to Figure [T|. 
Independently, for each query, we sample z, governed by a 
prior parameter r and based on its value decide whether 
to allow topical personalization or not. The parameter r is 
user-specific and controls the rate at which topical person- 
alization takes place (for that user). 

Based on the value of z, we pick a probability distribu- 
tion over permutations and sample tt from it. Probabilistic 
models on permutations have recently been applied to solve 
various problems related to ranking |20' . Probability distri- 
butions defined over permutations can be broadly catego- 
rized into two types — distance based and score based. In a 
distance based model [l5], the probability of a permutation 
is defined according to its distance from a central permuta- 
tion. They have rich expressive power as they can incorpo- 
rate a wide variety of distance functions over permutations 
but are, in general, computationally inefficient. 
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Figure 2: An example illustrating the steps of /. We 
have assumed /i = 1. At each stage, the actual out- 
come is marked in blue and the most likely outcome 
is marked in red. 

Score based models [12], on the other hand, are very ef- 
ficient as they divide permutation construction into stages 
and assign scores on each stage such that the final proba- 
bility is a combination (multiplication) of stage-wise scores. 
However, being defined as a specific function over scores, 
they have limited expressive power e.g. they can not take 
into account any central permutation in the generative pro- 
cess. For LTP, we have a central permutation (vanilla list a) 
and want to model tt as being generated from it. Further, 
as explained later, we define scores on items as a function 
■q. Therefore, we need a model which combines the notion 
of distance with scores and is computationally efficient. 

The probability distribution / (Figure [T| is a process for 
generating the personalized response tt, and is decomposed 
into sequential stages. Observed that (see Figure[T]) this pro- 
cess is activated only if z — 0, thereby, implying no topical 
personalization should happen. In the first stage, we pick 

Note 



p(m(i-^~'^(i))) 



the item 7r(l) with probability ^ , _, , 

^ ' ■' 2jj>l <!xp{/j(l-o- ^irO))) 

that this probability is maximum when the two permuta- 
tions agree with the first position i.e. 7r(l) = cr(l). However, 
if we happen to pick some other item i.e. 7r(l) 7^ cr(l), then 
for the second stage, the most likely outcome is to bring 
back the item ct(1) and put it at the second position of tt 
i.e. 7r(2) = cr(l). 

In general, in the k*^ stage, the probability of selecting 
7r(fc) is cxp(At(fc-iT — I^P^^ . Intuitively, at each stage fc, 

^ ' 2jj>fc cxp(fi{fc-(T -'•ir(fe))) •" o ' 

the model determines the items among cr(l), (t(2), ■ ■ ■ , a{k — 
1) which are not yet sampled by / and assigns higher proba- 
bility on picking them. In Figure [2] gives an example of this 
sampling process. 

Considering all the stages, we obtain the overall probabil- 
ity of sampling tt which is given by the following expression 



/(7.i<T,p)=n 



exp(/i(i — <T "'"7r(?))) 



^oxp(/i(i - 
\j>i 



(2) 



It can be shown that / is a valid probability distribution i.e. 
/(tt I o-,m) > for all -k € S„ and fi^ I c^.m) = 1- The 
parameter fi controls the spread of the distribution i.e. if 

— >■ then / converges to the uniform distribution over Sn; 
otherwise, for fi > the distribution is concentrated around 
a. We assume fJ, > 1. 

We now describe our next permutation model g that cap- 
tures the topic-level personalization which is invoked only if 
z = 1. Model g is also decomposed into sequential stages 



and at each stage uses both the central permutation a and 
a set of scores, to determine n. Each item d G / is assigned 
a score r{^6d- In the i*^ stage, g selects the item 7r(i) with 
probability 

oxp(A))^e„(,) + (1 - - a-- V(j))) 
E,>, oxp(A,,r0,(,) + (1 - X)(i - a-^^(j))) 

The working principle for g is similar to /, except that 
it now allows for deviations from a only if it is explained 
by the scores. Parameter A is tuned to adjust the relative 
importance of the scores and the central permutation a. For 
example, if A = then the scores are ignored and if A = 1 
then the central permutation does not play any role. We 
treat < A < 1 as a free parameter whose value needs to be 
learned from the data. The overall probability of sampling 
TT is given by 



c/(7r I ri; <j, A, 



n 



oxp(A))^e„(.) + (1 - - <y-^n{i))) 
E,>. oxp(A,,r9„(,, + (1- X)(i - o-IttO))) 



It can be verified that g is also a valid probability distribu- 
tion. 

The generative process for the personalization block can 
be described as 

• For each user u 

1. Sample r ~ Beta{S, S). 

2. Sample ri ~ Gaussian(0, diag{'y^)) 

• For each query qi,i = 1,2, - ■ ■ ,m 

1. Sample Zi ~ Bernoulliij) to decide whether to 
allow topical personalization. 

2. If Zi — 1, sample tt; ~ g{- j at, A, 9, -q). 

3. Else, sample tt^ ~ /(■ | (Ji,fj). 

The joint distribution for the personalization block can be 
written as 

p(7r,z, r, j 61; 7, 5, /i. A, cr) = p{r] \ 7) ■ p(r | 5) 

m 

Y\_p{zi I r) • 3(7r, 1 ai,\,e,r]y'f{-n; j at, /i)^'^' (3) 

i = l 

Finally, the full joint distribution for LTP can be obtained 
by multiplying Equations [T] and [3] We treat the parame- 
ters f, a, S, 7 as constant and do not consider learning them. 
However, the parameters n and A that controls the permu- 
tation models need to be learned. We have assumed a Gaus- 
sian prior on rj. The role of this prior is to set rj to zero when 
we do not observe any significant difference between tt and 
a i.e TTi ~ ai. 

We first assume that A and jj, are predefined constants 
and describe the inference (LTP- INF) of the personalization 
vector 77 based on these values in Section [4. 2| We will then 
use LTP-INF to also estimate these parameters in Section 

EM 



4.2 Inference of Personalization Vector 

The key inferential problem that we study in this work is 
to obtain the posterior distribution on the latent variables 
i.e. to determine p{8, K, j3, z,T,rj \ a; A, /i). As with simpler 
topic models, the exact inference is intractable and there- 
fore, we resort to approximate inference techniques. Given 
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Personalization Block 

Figure 3: Variational distribution used for inferring 
personalization in LTP. 

the non-conjugacy of tt and 0, sampling based techniques 
are unlikely to be efficient. In this paper, we propose a 
variational approximation scheme. In a variational infer- 
ence, one defines a family of simpler distribution over the 
latent variables to approximate the true posterior distribu- 
tion. This family of distribution is indexed by additional 
parameters (called variational parameters) which are tuned 
so as to minimize the KL divergence with the true posterior. 

We first simplify the inference by breaking it into two 
parts. For the first part, we ignore the dependency be- 
tween the topic and the personalization block. Therefore, 
our strategy is to first infer the topics and use the inferred 
topics and the topic-maps of the items to carry out inference 
for the personafization block. This will simplify the expo- 
sition greatly and the ideas that we develop here will carry 
over naturally to the general case of inferring the blocks 
jointly. We revisit the inference for the complete model in 
Section |4.4[ Inference for the topic block follows standard 
techniques (see e.g. ^) and therefore, we omit the details 
here. For the rest of this sub-section we assume that the 
topics have been inferred and develop an inference scheme 
for the personalization block. 

For the personalization block, the key inferential prob- 
lem is to obtain the posterior distribution p{z, T,rj \ o; A, /^). 
This posterior is approximated with the help of a variational 
distribution r. Figure |3] illustrates its graphical model rep- 
resentation. The personalization vector 77 is assumed to be 
Gaussian with the following density 

r(r7 1 f]) = (2717^)"? exp (^"^('7 ^ ' in - vij 

Here, the variational parameter fj represents the mean of the 
gaussian and its variance is 7^1. For query Qi we assume that 
Zi is sampled from a Bernoulli distribution with parameter 
(pi e (0, 1). Finally, for user u, we assume that t is sam- 
pled from a beta distribution having the following density 
function 

K-ki,-2) = ^Jji^r--(l-rr- 

I (Ki jl (K2 ) 

where the parameters ki,K2 > and T{x) is the Gamma 
function. We use the notation ^'(a::) for the digamma func- 
tion which is defined as lnr(3;). 

ax ^ ^ 

The next step in our variational analysis is to learn the 
particular value of the parameters (<^, ki, K2, ^) that mini- 
mizes the KL divergence between r and the true posterior 
p. It can be showrrj that minimizing the KL divergence 
has the same effect as maximizing the following objective 

^"Refer to jSj for the proof. 



Algorithm 1 LTP-INF: Variational Inference Algorithm 
for LTP 

1: Input Training data-set (tt, (t)i,2, -- ,m; values for 
A,7,(5,/i; 

2: Output Values {(f>'i...^, k'i, n'2,fi') that maximize A; 
3: Initialization Randomly initialize to (cpfH^, '^i"', i'^2'\ 
jj'o') such that 1 > (t)[°^^ > and k^"', > 0; 

4: ^^0■ A(0)^A(,^(«L,-r,4°\'?'°'); 
5; while A has not converged do 
6; + 

9; for j — l...m do 

Er[lng(nj \ ri;aj,e,X)]; 
11: ^ 1/(1 + e^O; /* Update (j>j */ 

12: end for 

13: 77''' argmax A{cj>^j^},^, k^j^\ K2 \'rj); /* Use conju- 

V 

gate gradient to optimize this block */ 
14: AW^A(<^«_,K«,4\r?«); 
15: end while 

16: return {<i>^ll.^,nf .nf ,4"^) 



function, 

A(0,Ki,«:2,^7) =E, [Inp] -f e(r) (4) 

where EI(r) is the entropy and Er denotes expectation w.r.t 
the distribution r. 

We use block coordinate-wise ascent to maximize the ex- 
pression in Equation |4] Intuitively, we perform fixed point 
iterations by updating one block of parameters at a time, 
keeping all other parameters fixed to their most recent value. 
The update rule for parameters (/>i,2, -. ,m, ki, K2 are obtained 
by setting the partial derivatives of A to zero. Due to our 
choice of r, the update rules for (j>,ni,K2 are particularly 
simple and have closed- form expressions. 

To maximize A with respect to fj, we use the conjugate 
gradient algorithirf"] The objective function for fj can be 
written as 

L{V) = -^fj' ■ fi + ^(1 - ■ [lng(7r, j a,,e, A)] 

i 

It can be proved that L is concave (with respect to ij) 
and therefore, using simple optimizers like conjugate gra- 
dient, we will be able to obtain the global maximum [2]. 
Algorithm [1] summarizes the inference procedure. See Sec- 
tion |4j2T] for the derivations. 

4.2.1 Derivations 

We now outline the key steps in deriving the update equa- 
tions. Our first goal is to obtain the expressions for the 
entropy and expectation terms (Equation |4]|. 

Entropy of z The multinomial variate z has a simple 
entropy expression given by — X^^i In 4'i + {^~'t>i) ln(l — 

Entropy of r The entropy expression for the beta variate 



^ ^ http: / / on.wikipedia.org 
/wiki/NonIinear_conjugate_gradient_method 



r is well-known and given by the following expression, 

Inr(Ki) + lnr(K2) - lnr(Ki + Ka) 
- 1)*(ki) - (k2 - 1)1'(«:2) + (fti + K2 - 2)1'(«:i + K2) 

Entropy of rj The entropy of a gaussian is a function 
of the covariance matrix only, which is 7^1 and therefore a 
constant. 

Deriving Er[lnp(2i | r)] This expression requires us to 
determine Er[lnr] which can be obtained using the tech- 
nique outlined in Blei et.al. Finally, the expression is 
given by 

<;/>i(*(Ki) - + K2)) + (1 - (pi){-^{K.2) - + K2)) 

Deriving Er[lnp(r | 5)] This derivation is similar to the 
last one and is given by 

{S - l)(*(fti) + ■f{K.2) - -fini + K2)) 

Deriving Er [In p(r; | 7)! This can be derived using stan- 
dard gaussian identities The expression is simply given 
by -^v' ■ V- 

Deriving Er[ln(7ri | cTi,ri)] This derivation is more subtle. 
First observe that the expression is of the form ln(e*i'' + 
_|_ . . . ) This expression is in general unwieldy as the 
exponential terms appear inside the logarithm. We use a 
standard trick of simplifying this form in the following way, 



Algorithm 2 LTP-EM: Variational EM Algorithm for LTP 



ln(e' 



"'1'' e"^'' 



+ ■ 



4,. 



■) + lnC-l 



where ^ is an additional variational parameter. Observe that 
the inequality holds for every C > and equality is attained 
only for 



We now have to deal with the expectation term Er[e* '']. 
Observe that in the variational model, we have assumed rji 
to be independent (conditioned on 7) and therefore, this 
expression is equivalent to 



Erie"' '']=Er[Y[e'' 



The expectation term can be derived using the Moment Gen- 
erating Function of gaussian distribution and evaluates to 

The expression for ELBO can be obtained by summing up 
all the entropy and expectation terms. Finally the update 
equations are derived by setting the partial derivatives of 
ELBO to zero for each block. 

4.3 Parameter Estimation 

We now focus our attention at learning A and /i. We use 
Maximum Likelihood Estimators (MLE) for this, where one 
finds the value of the parameters that maximizes the (log) 
likelihood of the observed data i.e. the following expression 



lnp(7r I A, ^; a) = ^ lnp(7ri j A, fi; Oi 



(5) 



^^see http://en.wikipedia.org/wiki/Beta_distribution 
^^See www.cs.nyu.edu/ roweis/notes/gaussid.pdf 



Input Training data-set (7r,a)i,2,- 



Output Values (A',/i') that maximize Equation|5] 
Initialization Randomly initialize (A'^''^ /i'"') s.t. < 
A«') < 1 and ^i'"' > 0. 
4: while (A, \i) have not converged do 
5: E-step /* The variational inference step * / 

. (0'i...„,Aj'i,K^,r^') ^LTP-INF(a,^,A«,^«); 

• A(^'(A,^)^ E [Inp]; 

r{0',K'j,K^,?7') 

6: M-step /* Learn new estimates of the parameters */ 



i -s— i -I- 1 
end while 
return (A^'^^if'') 



argmax A'*-* (A, fi) 



/J>0 
1>A>0 



However, to calculate the likelihood function, we have to 
marginalize over the latent variables which is difficult in our 
model for both real variables {rj, r), as it leads to inte- 
grals that are analytically intractable, and discrete variables 
(21. ..m), it involves computationally expensive sum over ex- 
ponential (i.e. 2™) number of terms. 

We use the variational Expectation Maximization (EM) 
algorithm to circumvent this difficulty. In the E-step, Al- 
gorithm [l] approximates the true posterior distribution over 
the latent variables, using the current estimates of the pa- 
rameters. The variational parameters learned in this step 
are used in the subsequent M-step to maximize the likeli- 
hood function (over the true parameters A and /i). 

Algorithm [2] summarizes the steps of the variational EM. 
It can be shown (see Section 4.2.1 1 that the constraint max- 
imization problem in step 6 is a concave program and there- 
fore, can be solved optimally and efficiently [2]. 



4.4 Learning Topic Distributions 

For inference in the topic block (Figure [T]), we augment 
our variational distribution with additional parameters in 
the following way. Topic distribution /3fc is sampled from 
a Dirichlet prior with parameters {/3fc,,„ | w £ V}. The 
topic assignments Kij are sampled from a multinomial dis- 
tribution with parameters ijJij^i...T and 9i is sampled from 
a normal distribution with mean 9i and variance a^I. Us- 
ing the same recipe as in Section [4.2| (c.f. Equation [4|, we 
arrive at the following simple update rule for learning the 
topic distributions 



E 



The topic assignments LUij also has a closed form update 
rule as given by uji,j,k oc exp(Er[ln6'i] -f Er[ln/3fc,„- J) 

Learning of topic-maps of the urls (i.e. 9i's) is more sub- 
tle. The main difficulty stems from the coupling between 
the personalization and the topic blocks through 9. While 
determining Kr[lng{Tv | ri,9;a,X)] (step 8 of Algorithm 
we now have to take expectation over 9, in addition to 77. 
Specifically, we have to compute an expectation of the form 



Er[exp(A77' ■ 



6' ■ 6)] which is however tracktable due 



to our assumption of independence and gaussian priors on 9 
and T). We use gradient descent on 6 to solve it. The rest of 
the calculation remains unchanged. 

5. EXPERIMENTS 

In this section, we describe a comprehensive set of exper- 
iments designed to evaluate the accuracy and effectiveness 
of our techniques. 

5.1 Datasets 

The input to our algorithm consists of a set of queries and 
the personalized and vanilla results (i.e. tt, a pairs) for them, 
returned by a search engine. During the training phase, we 
present these queries to LTP and let it learn the personaliza- 
tion vector 77. Once r) is learned, the next step is to validate 
it, by measuring how well it corresponds to the ground truth. 
However, in practice, such validation schemes are often diffi- 
cult to design as the search engines do not reveal the actual 
user profilsrj We therefore perform our experiments on 
both real-world dataset comprised of Google search history 
of a few users, and a large scale synthetic dataset. 

5.1.1 Google Search Personalization 

We collected search result and history dat j^from 10 real 
Google users. This data collection was done as part of a 
larger survey to understand the topic level personalization 
and privacy concerns of users, and is part of an ongoing ini- 
tiative to build a privacy evaluation toolkit (see Section [L2| ). 
Of this larger group, due to privacy concern, only 10 partic- 
ipants volunteered to share their search history. 

For these 10 users, we fetched their entire history of search 
queries. The average number of (distinct) search queries 
was 872. We issued each query to Google both by using 
their login credentials and without it to retrieve the search 
results. We used the Mallet |18| t oolkit to extract topics 
from the entire collection of urlsrj returned for all queries, 
for each user. 

We found ample evidence of profile based personaliza- 
tion on Google. Even when the personalized and vanilla 
queries were performed with identical parameters, such as 
location and IP address (same machine), user-agent, other 
http-connection, etc., roughly 30% queries received person- 
alized results. We also found that the personalization is 
much more subtle compared to the impression we get from 
search personalization literature (and our experiments with 
AlterEgo server) — most queries (~ 70%) were not person- 
alized and while there were some queries with fair amount 
of personalization, on an average, we observed very little 
difference between the resultj^ 

5.1.2 AlterEgo 



^^Google, however, publishes the categories of topics used 
to serve personalized ads. Unfortunately, this data is not 
quite helpful as the categories are very high level and do not 
convey rich enough information. 

^^http: / /history.google. com/history 

^^We used the snippets that Google returns along with 
the search results to obtain text for the urls. 

^'^The avg. EMD (earth mover's distance) over queries 
with personalization was 5.9 (e.g. the EMD of moving a 
single url at rank 5 to rank 1 is 4) 



We use an open source search personalization engine called 
AlterEgo [Tt] to generate the synthetic dataset. AlterEgo 
contains implementation of various popular profiling and 
personalization techniques; we used their "unique matching" 
technique for our experimentj^ In our simulation, we 
used AlterEgo as a surrogate personalization engine i.e. we 
obtain the vanilla result from Google and use AlterEgo to 
personalize it. The benefit of this approach is that we can 
train AlterEgo on topics of our choice and use this infor- 
mation to validate the model output 77. The work-fiow and 
details of the data generation steps are presented below. 
Generating Topics We extracted a set of 500 topics by 
running Mallet on approximately 420k urls obtained from 
the Delicious dataset 27 . We manually select 50 topics and 



label them into 10 categories (examples are health, cooking, 
science, finance, etc.); these topics serve as a ground-truth 
for us. The selection of these topic categories and urls (used 
in the next step) is intended to simulate a typical user be- 
havior, where, a user in interested in ~ 10 categories of 
topics. 

Training AlterEgo For each topic, we inspect the topic- 
maps of the urls and identify the ones which have significant 
(> 0.2) weight (on this topic). These urls are used to train 
AlterEgo profile. We generated 10 profiles trained on a sub- 
set of 1 to 10 topics (i.e. 10 profile for 1 topic, 10 profile on 
2 randomly selected topics, and so on), generating a total of 
50 profiles. 

Queries We generated 500 queries for each topic by ran- 
domly combining the top 10 relevant words from them. This 
gives us a total of 5k queries (over 10 categories). For each 
query, we retrieved the vanilla results from Google. Note 
that, if a query is related to a topic used for training the 
profile, only then AlterEgo will be able to personalize it. 
Otherwise, the vanilla and personalized results will be more 
or less identical. 

5.2 Implementation Details 

We implemented Algorithms[l]and[2]in the Java program- 
ming language. For solving the convex program in Algo- 
rithm [5] (step 6), we use JOptimizer fol - a Java based open 
source optimization package. All our experiments are car- 
ried out on a Intel Pentium IV machine with 3.0GHz pro- 
cessor and 4GB of RAM. 

We use the following values of the hyperparameters : 5 = 
2.0,7 = l-O- For computational efficiency, we used Mallet 

1|| a nd do not use 
Ml 



for inference in the topic-block (see Figure 
the inference process described in Section 



5.3 Results with the AlterEgo data-set 

In this section, we summarize the result of our experi- 
ments with the AlterEgo data-set. 

5. 3. 1 Precision-Recall 

Our first set of experiments are designed to evaluate the 
accuracy of LTP in correctly learning the personalized top- 
ics. On each AlterEgo profile, we train LTP and learn the 
personalization vector -q. Next we compare it with the ac- 
tual list of topics that were used to train this profile (by 
AlterEgo). Let Tact be the true set of personalized topics 
and Tinf be the one inferred by LTP. For this experiment. 



^*We also did experiments with their "matching" tech- 
nique, and got very similar results which are omitted due 
to lack of space. 



we measure the precision and recall values, where precision 
is defined e. the fraction of reported topics 

that are actually personalized and recall by ^'^"jr'^"^'"^ ^ i-e- 
the fraction of the original personalized topics that we are 
able to identify. 



p@l 


P@3 


P@5 


R-pre 


P@+l 


P@+3 


MAP 


97.80 


84.02 


70.60 


84.66 


70.69 


54.44 


97.60 



Table 1: Performance (in %) of LTP in finding per- 
sonalized topics. 

We re-order the topics based on the (decreasing) value of 
7] computed by LTP. For each k, we declare the top-k topics 
(with maximum rj values) as personalized and calculate the 
precision and recall value for this decision. Table [l] sum- 
marizes the precision scores obtained by LTP. Specifically, 
we evaluate its performance in terms of Precision@l(P@l), 
P@3, P@5, R-precision (R-pre) and mean average precision 
(MAP) [5] [7] . Note that the size of actual topics was quite 
different for different runs (varies from 1-10). Hence, along 
with the top-k topics, we also study the precision at jTact + fcl 
(denoted as P@-|-k). 
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Figure 4: Precision-Recall results for LTP in retriev- 
ing tlie personalized topics. 

In Figure [4] we illustrate the recall performance of our 
algorithm. At the expense of low precision (< 0.4), LTP is 
able to retrieve all the personalized topics (recall > 0.93) 
and its recall performance is relatively insensitive to preci- 
sion; however, if we require high precision ( > 0.8), the recall 
drops to ~ 0.5. As evident from the figure, a typical operat- 
ing characteristic of LTP is precision ~ 0.7 and recall « 0.7, 
which is achieved when we return top-3 topics. 

5.3.2 Classification Tests 

In this section, we develop two classification tests to evalu- 
ate LTP's predictive power. For both these experiments, we 
randomly split the n,a list into data-sets Dl (80%), used 
for training LTP, and D2 (20%), used for testing. We re- 
peat this split with 10 random seeds and report the average 
number in all the data presented below. 

Query Disambiguation In this experiment, while test- 
ing on D2, we hide which result is personalized and which 
one is vanilla and the task of the model is to determine the 
correct labels. 

We proceed with the classification task in the following 
way. Let rj' be the parameter learned by LTP during the 



^topics 


Accuracy {fi ± a) 


Time (sees) 




LTP-EM 


LTP-INF 


LTP-EM 


LTP-INF 


1 


.74± .09 


.72 ± .09 


80.7 


22.7 


2 


.72± .06 


.70 ± .09 


154.3 


31.5 


3 


.70± .05 


.68 ± .06 


221.6 


42.4 


4 


.69 ± .04 


.67± .05 


272.2 


53.7 


5 


.69± .05 


.67± .05 


336.1 


69.8 


6 


.67± .04 


.65 ± .05 


333.2 


70.7 


7 


.65± .04 


.65 ± .05 


342.5 


71.1 


8 


.63 ± .04 


.63 ± .04 


348.2 


73.6 


9 


.63± .05 


.62 ± .05 


354.4 


76.4 


10 


.62± .02 


.62 ± .02 


359.2 


79.5 



Table 2: Summary of results with the AlterEgo 
dataset 



training. For input lists h and h, LTP calculates the like- 
lihood values p{li I hyrj') and p{l2 \ h,r]') and whichever 
likelihood is higher is assigned to the personalized result i.e. 
if p{l\ I htV') > P(^2 I h^Tj') then l\ is declared to be the 
personalized result and vice versa. We name this test as P- V 
disambiguation for a given profile. Over all the test points 
in D2, the fraction of queries that were labeled correctly is 
referred to as disambiguation accuracy. 

Table [2] summarizes the result of this experiment. In sum- 
mary, we achieve disambiguation accuracy in the range of 
62-74%. For each profile, we collect the accuracy values for 
the 10 different runs and report its mean and standard devi- 
ation (/.irt cr). Observe that our accuracy decreases slightly 
as the AlterEgo profile is trained with more and more topics. 

Table [2] also reports the training time of LTP-EM . For 
profiles trained with many topics, LTP-EM takes more time 
to converge. We repeat the experiment with LTP-INF with 
the parameter values fixed to A = 0.9 and fj, = 10.0. As the 
results show, LTP-INF is up to 5 times faster to train but 
achieves slightly lower accuracy. The accuracy however, im- 
proves slightly (< 3%) if we increase the amount of training 
data (Dl) from 80% to 90% (not shown in the table). 

User Classification For this experiment, we consider 
groups of users (i.e. profiles) and develop a classification 
test within the group members. We vary the size of the 
group from 2 to 10 and for each group size, randomly pick 
10 groups. For each group G, we present a (tt, cr) pair to 
LTP but do not reveal the user it belongs to. The task of 
the model is to correctly predict the user. 

We again use the likelihood test for this task. Specifically, 
for each user it £ G and input (tt, ct), we calculate piji \ a, r/J,) 
(77(1 learned during training) and output the user for which 
the likelihood attains its maximum value. 

In Figure [5] we summarize the result of this experiment. 
There are two parameters in this experiment - the size of 
the group and the number of topics used to train AlterEgo 
for each profile in the group. For simplicity, we present here 
results for the homogenous case, where we combine profile 
which are trained on the same number of topics Observe 
that the accuracy reported by LTP is significantly higher 
than a random guess (which is 1/g, g being the group size). 
The accuracy decreases slightly if profiles are trained with 
many topics. We believe this reduction in accuracy is also an 



^®We also performed experiments on the general case (e.g. 
by grouping profiles trained on 3 topics with 5 topics). The 
results are similar and not repeated here. 
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Number of users in a group 
Figure 5: Performance of LTP in user classification. 



artifact of our data generation — profiles trained on multiple 
topics can (and do) have topics in common, that will make 
it hard to distinguish personalized response on two profile 
trained on the same topic. 

In summary, these results, together with the precision- 
recall values from last section highlight that our model fits 
the data well and learns the correct set of personalized topics 
on synthetic data. 

5.4 Results with the Google dataset 

In this section we describe the results with the Google 
dataset. Note that since we do not know the actual person- 
alization on different topics (ground truth) for a real Google 
user, we cannot perform the precision-recall experiments as 
with AlterEgo dataset, and resort to only query disambigua- 
tion and user classification test described above. However, 
we also perform some qualitative tests that give ample indi- 
cation that we have found a good personalization vector. 

5.4.1 Qualitative Evidences for Correctness ofrj 

We now present our analysis on finding qualitative correct- 
ness of 77 using evidences of personalization. An evidence is 
an instance of tt, o where results were re-ranked such that 
the ones with rj were moved up. Note that while such ev- 
idence have no statistical significance, they are much more 
helpful for a user's understanding of his profile compared to 
the personalization vector. Such evidences are a core feature 
of the privacy toolkit we are building (see Sect ion [l. 2 [ |. 

Figure [6] shows an example evidence of personalization 
happening on a user's account. The result for query Q ("how 
to decide mixing of markov chain") and theta values for two 
relevant topics Tl (about "Algorithms" defined by words al- 
gorithm, design, complexity) and T2 (about "Probability" 
defined by words probability, distribution) are shown. For 
this user, rj value for Tl is very high compared to T2. Ob- 
serve that the wiki link Ul (in the box), although less rele- 
vant to the query, is placed higher in the personalized results. 
As our analysis shows, U2 is has a high weight on topic Tl 
compared to U2, which leads to this personalization. The 
user can therefore see not just his inferred interests (more 
in "Algorithms" compared to "Probability"), but also how it 
affects his results. 

We next move to another qualitative analysis of r) by com- 
paring it directly with the categories Google itself associates 



Google Category 


Topic in LTP 




Comics & Animation - 
Anime & Manga 


online read manga 
kyojin shingeki chapter 


0.60 


Autos & Vehicles - 
Vehicle Shopping 


car India chrysler 
price jaguar sport bmw 


0.42 


Computers - 
Software Utilities 


class import common 
org public implement 


0.15 


World Localities - 
South Asia 


Seoul citi hotel 
location shop mall coex 


0.13 



Table 3: Correlation between personalized topics in 
LTP and Google categories. 



User Id 


15 Topics 


20 Topics 


50 Topics 


100 Topics 


1 


74±5 


70±5 


70±6 


73±4 


2 


68±5 


70±4 


70±4 


65±3 


3 


67±13 


72±14 


67±13 


73±11 


4 


54±8 


51±6 


55±6 


59±7 


5 


54±11 


47±9 


49±11 


43±9 


6 


85±7 


78±5 


84±4 


81±7 


7 


73±4 


70±5 


71±6 


73±6 


8 


66±3 


62±3 


61±4 


64±3 


9 


52±4 


52±3 


50±4 


54±4 



Table 4: Accuracy of LTP over 9 Google users. 



with a useirj We try to match topics with high 77 (top-k 
such topicsjwith the broad categories in Google. Table [3] 
shows the result of such matching for 3 users. Take for 
example, the "Anime and Manga" category, that was also 
assigned a very high 7; = .6 (compared to an average value 
of .004) by LTP. 

Such anecdotes show that our techniques have, in fact, 
learned the personalization vector correctly. 
5.4.2 Quantitative Experiments 

Query Disambiguation Table [4] summarizes the result 
of query disambiguation on the Google dataset. We first 
study the effects of number of topics (T) chosen for the 
user. We notice that only a few topics 15-50 are enough 
to get good accuracy for any user. Our accuracy results 
differ significantly for different users, varying from as low 
as 54% to 85%. We believe this is because the amount of 
personalization is different for various users, and this affects 
the learning accuracy of our techniques. 

User Classification Table[5]show that even with 3 users, 
we are able to get an accuracy of up to 60%. For this exper- 
iments, we extracted r] values over a common set of topics 
for each user. These r] values learned were also very different 
for different users (data not shown). This shows that 77 is in 
fact learned tailored to the personalization of each user. 

6. CONCLUSIONS 

In this paper we have presented a novel approach to ex- 
tract user profile information in the form of personalization 
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Figure 6: An example to illustrate the difference between personalized (left) and vanilla (right) search results 
(for a real user) returned by Google. 



vector over topics from commercial OSPs (such as Google 
search). Our approach treats OSPs as black-boxes, i.e. as- 
sumes no knowledge of the personalization algorithms and 
history of users maintained by them, and works by compar- 
ing the personalized and vanilla content served by them. 

To the best of our knowledge, this is the first work that 
tries to extract information based solely on mining the out- 
put of OSPs. This aspect of our work make it unique and 
is beneficial in not just enabling access to (so far hidden) 
profiles in OSPs, but also in providing a novel and practical 
approach for retrieving information from OSPs by mining 
differences in their outputs. 

Our approach also has direct benefits for end users, as it 
for the first time, enables them to access their (so far hidden) 
profile information tracked by an OSP. While being an in- 
formational tool by itself, this has wider implications to the 
outlook of user privacy research — it can be used to infer the 
personalization happening on sensitive topics (e.g. financial, 
medical history, etc.), which a user may not be comfortable 
with. We believe that this can be used to build an end- 
user privacy perserving tool and are currently working on a 
prototype for the same. 
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